Introduction
Pairs trading is a trading strategy that remains neutral to market movements, wherein the focus is on recognizing two closely correlated financial instruments, such as stocks, and engaging in trading based on their relative price movements. In this project, two pairs of shares will be selected according to correlations and pairs trading strategy will be applied to those shares.
In the first part of the project, the price differences between two stocks fit into a regression model. Pairs trading will be applied according to the control chart constructed with residual values of this regression model. When we plotted the data, we observed that residuals are distributed with constant variance and we have constructed our study based on this assumption.
!pip install statsmodels
Requirement already satisfied: statsmodels in /usr/local/lib/python3.10/dist-packages (0.14.0) Requirement already satisfied: numpy>=1.18 in /usr/local/lib/python3.10/dist-packages (from statsmodels) (1.23.5) Requirement already satisfied: scipy!=1.9.2,>=1.4 in /usr/local/lib/python3.10/dist-packages (from statsmodels) (1.11.4) Requirement already satisfied: pandas>=1.0 in /usr/local/lib/python3.10/dist-packages (from statsmodels) (1.5.3) Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.10/dist-packages (from statsmodels) (0.5.3) Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from statsmodels) (23.2) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0->statsmodels) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0->statsmodels) (2023.3.post1) Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.2->statsmodels) (1.16.0)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
ChatGPT prompt: how to aggregate 4 different csv files into 1 ?
data1 = pd.read_csv("20221226_20230326_bist30.csv")
data2 = pd.read_csv("20230327_20230625_bist30.csv")
data3 = pd.read_csv("20230626_20230924_bist30.csv")
data4 = pd.read_csv("20230925_20231224_bist30.csv")
# List of file paths
csv_files = [data1,data2,data3,data4]
# Read each CSV file into a DataFrame and store them in a list
dfs = [file for file in csv_files]
# Concatenate all DataFrames into one
data1 = pd.concat(dfs, ignore_index=True)
data1.head()
| timestamp | price | short_name | |
|---|---|---|---|
| 0 | 2022-12-26 09:00:00+03:00 | 143.3 | THYAO |
| 1 | 2022-12-26 10:00:00+03:00 | 145.3 | THYAO |
| 2 | 2022-12-26 11:00:00+03:00 | 144.1 | THYAO |
| 3 | 2022-12-26 12:00:00+03:00 | 144.2 | THYAO |
| 4 | 2022-12-26 13:00:00+03:00 | 144.3 | THYAO |
data1.dtypes
timestamp object price float64 short_name object dtype: object
data1["timestamp"] = pd.to_datetime(data1["timestamp"])
companies = data1["short_name"].unique()
companies
array(['THYAO', 'AKBNK', 'ARCLK', 'ASELS', 'BIMAS', 'DOHOL', 'EKGYO',
'EREGL', 'FROTO', 'GUBRF', 'GARAN', 'KRDMD', 'KCHOL', 'KOZAL',
'KOZAA', 'PGSUS', 'PETKM', 'SAHOL', 'SASA', 'SISE', 'TAVHL',
'TKFEN', 'TUPRS', 'TTKOM', 'TCELL', 'HALKB', 'ISCTR', 'VAKBN',
'VESTL', 'YKBNK'], dtype=object)
def seperate_data(data1):
df_list = []
for values in data1["short_name"].unique():
values = data1[data1["short_name"] == values]
df_list.append(values)
return df_list
df_list = seperate_data(data1)
df_list[0]
| timestamp | price | short_name | |
|---|---|---|---|
| 0 | 2022-12-26 09:00:00+03:00 | 143.30 | THYAO |
| 1 | 2022-12-26 10:00:00+03:00 | 145.30 | THYAO |
| 2 | 2022-12-26 11:00:00+03:00 | 144.10 | THYAO |
| 3 | 2022-12-26 12:00:00+03:00 | 144.20 | THYAO |
| 4 | 2022-12-26 13:00:00+03:00 | 144.30 | THYAO |
| ... | ... | ... | ... |
| 54947 | 2023-11-21 09:00:00+03:00 | 256.00 | THYAO |
| 54948 | 2023-11-21 10:00:00+03:00 | 255.00 | THYAO |
| 54949 | 2023-11-21 11:00:00+03:00 | 257.75 | THYAO |
| 54950 | 2023-11-21 12:00:00+03:00 | 258.25 | THYAO |
| 54951 | 2023-11-21 13:00:00+03:00 | 258.25 | THYAO |
2233 rows × 3 columns
for i in range(len(df_list)):
df_list[i].plot(title=companies[i],x="timestamp", y="price", figsize=(20,6))
/usr/local/lib/python3.10/dist-packages/pandas/plotting/_matplotlib/core.py:512: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`. fig = self.plt.figure(figsize=self.figsize)
ChatGPT prompt: I have a list consists of dfs which has same columns. Write a python code to contruct the correlation matrix.
import pandas as pd
def calculate_price_correlation(dataframes):
# Merge all dataframes on 'timestamp', keeping only the 'price' column from each
merged_df = None
for i, df in enumerate(dataframes):
# Renaming the price column to include the dataframe identifier
df_renamed = df.rename(columns={'price': companies[i]})
if merged_df is None:
merged_df = df_renamed
else:
# Merging on 'timestamp'
merged_df = pd.merge(merged_df, df_renamed[['timestamp', companies[i]]], on='timestamp', how='inner')
# Calculate the correlation matrix for the price columns, explicitly stating numeric_only=True
correlation_matrix = merged_df.corr(numeric_only=True)
return correlation_matrix, merged_df
# Example usage:
# dfs = [df1, df2, df3] # Your list of dataframes
# correlation_matrix = calculate_price_correlation(dfs)
correlation_matrix, merged_df = calculate_price_correlation(df_list)
correlation_matrix
| THYAO | AKBNK | ARCLK | ASELS | BIMAS | DOHOL | EKGYO | EREGL | FROTO | GUBRF | ... | TAVHL | TKFEN | TUPRS | TTKOM | TCELL | HALKB | ISCTR | VAKBN | VESTL | YKBNK | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| THYAO | 1.000000 | 0.927990 | 0.866390 | 0.865075 | 0.876044 | 0.909923 | 0.584238 | 0.485534 | 0.952126 | 0.840671 | ... | 0.944557 | 0.756897 | 0.821960 | 0.642405 | 0.928303 | 0.808102 | 0.805504 | 0.780396 | 0.479559 | 0.896445 |
| AKBNK | 0.927990 | 1.000000 | 0.896962 | 0.921354 | 0.969132 | 0.963283 | 0.440232 | 0.475807 | 0.919603 | 0.852034 | ... | 0.955705 | 0.792686 | 0.926268 | 0.589922 | 0.946678 | 0.847779 | 0.943880 | 0.901830 | 0.484287 | 0.976499 |
| ARCLK | 0.866390 | 0.896962 | 1.000000 | 0.890969 | 0.824788 | 0.909530 | 0.608780 | 0.658467 | 0.886032 | 0.753549 | ... | 0.915285 | 0.836240 | 0.829604 | 0.694350 | 0.860706 | 0.808142 | 0.836013 | 0.779591 | 0.615708 | 0.858877 |
| ASELS | 0.865075 | 0.921354 | 0.890969 | 1.000000 | 0.890778 | 0.949158 | 0.548941 | 0.623616 | 0.824047 | 0.876648 | ... | 0.936430 | 0.865535 | 0.910466 | 0.728698 | 0.926500 | 0.774860 | 0.863003 | 0.921890 | 0.710938 | 0.876496 |
| BIMAS | 0.876044 | 0.969132 | 0.824788 | 0.890778 | 1.000000 | 0.923392 | 0.290123 | 0.387121 | 0.876105 | 0.825977 | ... | 0.904318 | 0.703699 | 0.949890 | 0.485523 | 0.911954 | 0.777245 | 0.946244 | 0.918410 | 0.392410 | 0.969189 |
| DOHOL | 0.909923 | 0.963283 | 0.909530 | 0.949158 | 0.923392 | 1.000000 | 0.541357 | 0.554893 | 0.886641 | 0.871338 | ... | 0.968754 | 0.846989 | 0.903814 | 0.693496 | 0.949857 | 0.827678 | 0.897552 | 0.920870 | 0.616914 | 0.938717 |
| EKGYO | 0.584238 | 0.440232 | 0.608780 | 0.548941 | 0.290123 | 0.541357 | 1.000000 | 0.783809 | 0.459013 | 0.551574 | ... | 0.595083 | 0.734612 | 0.323870 | 0.792803 | 0.561368 | 0.604534 | 0.313360 | 0.375774 | 0.752757 | 0.362189 |
| EREGL | 0.485534 | 0.475807 | 0.658467 | 0.623616 | 0.387121 | 0.554893 | 0.783809 | 1.000000 | 0.386585 | 0.465682 | ... | 0.563725 | 0.721908 | 0.520701 | 0.737080 | 0.544204 | 0.533726 | 0.469383 | 0.506496 | 0.786236 | 0.409990 |
| FROTO | 0.952126 | 0.919603 | 0.886032 | 0.824047 | 0.876105 | 0.886641 | 0.459013 | 0.386585 | 1.000000 | 0.738193 | ... | 0.900130 | 0.671257 | 0.793044 | 0.512163 | 0.871003 | 0.775621 | 0.802492 | 0.737896 | 0.342505 | 0.896897 |
| GUBRF | 0.840671 | 0.852034 | 0.753549 | 0.876648 | 0.825977 | 0.871338 | 0.551574 | 0.465682 | 0.738193 | 1.000000 | ... | 0.891405 | 0.847256 | 0.825257 | 0.690217 | 0.876429 | 0.797647 | 0.808500 | 0.865353 | 0.659768 | 0.832366 |
| GARAN | 0.936405 | 0.982634 | 0.855309 | 0.875128 | 0.952575 | 0.940310 | 0.434799 | 0.431791 | 0.915027 | 0.826220 | ... | 0.935854 | 0.744842 | 0.895678 | 0.575474 | 0.952279 | 0.831733 | 0.911510 | 0.856370 | 0.419748 | 0.958100 |
| KRDMD | 0.850504 | 0.891221 | 0.917298 | 0.876176 | 0.852987 | 0.884422 | 0.566729 | 0.699587 | 0.839076 | 0.776296 | ... | 0.897717 | 0.821778 | 0.878146 | 0.626763 | 0.826950 | 0.839683 | 0.879593 | 0.836341 | 0.584102 | 0.882022 |
| KCHOL | 0.948396 | 0.985123 | 0.882573 | 0.897888 | 0.969369 | 0.952369 | 0.431138 | 0.446477 | 0.941949 | 0.853293 | ... | 0.950463 | 0.761246 | 0.919686 | 0.567965 | 0.939899 | 0.836689 | 0.929791 | 0.887091 | 0.443547 | 0.977934 |
| KOZAL | 0.597493 | 0.517969 | 0.636085 | 0.643503 | 0.392027 | 0.593266 | 0.855182 | 0.744061 | 0.477452 | 0.632341 | ... | 0.665582 | 0.827686 | 0.436420 | 0.866765 | 0.633225 | 0.661240 | 0.438018 | 0.483373 | 0.805992 | 0.442394 |
| KOZAA | 0.808950 | 0.793641 | 0.813552 | 0.849649 | 0.719037 | 0.820617 | 0.728171 | 0.677754 | 0.713087 | 0.870559 | ... | 0.873566 | 0.922447 | 0.739816 | 0.824139 | 0.844488 | 0.833473 | 0.745065 | 0.769079 | 0.770427 | 0.747953 |
| PGSUS | 0.973862 | 0.916666 | 0.897048 | 0.859732 | 0.840604 | 0.903514 | 0.622864 | 0.548019 | 0.941508 | 0.791291 | ... | 0.932460 | 0.764270 | 0.796956 | 0.674595 | 0.930326 | 0.784223 | 0.783077 | 0.736821 | 0.505409 | 0.858983 |
| PETKM | 0.706113 | 0.785155 | 0.762838 | 0.896137 | 0.766124 | 0.852132 | 0.610603 | 0.721469 | 0.601539 | 0.834191 | ... | 0.848453 | 0.879486 | 0.817417 | 0.758099 | 0.817684 | 0.716367 | 0.768161 | 0.899469 | 0.837843 | 0.762059 |
| SAHOL | 0.961033 | 0.973862 | 0.919293 | 0.915937 | 0.924503 | 0.960788 | 0.560149 | 0.549324 | 0.938162 | 0.836324 | ... | 0.958283 | 0.802812 | 0.887315 | 0.665166 | 0.963797 | 0.835493 | 0.885209 | 0.849012 | 0.526301 | 0.935367 |
| SASA | 0.375087 | 0.120469 | 0.247342 | 0.073010 | -0.011738 | 0.139850 | 0.502159 | 0.116831 | 0.394778 | 0.127715 | ... | 0.223923 | 0.156974 | -0.097911 | 0.207670 | 0.138371 | 0.307441 | -0.046961 | -0.092789 | 0.034113 | 0.106580 |
| SISE | 0.848253 | 0.872174 | 0.962571 | 0.884939 | 0.799592 | 0.882810 | 0.655171 | 0.742328 | 0.855861 | 0.726750 | ... | 0.877271 | 0.831436 | 0.822775 | 0.689290 | 0.851632 | 0.823838 | 0.824597 | 0.764601 | 0.613955 | 0.825362 |
| TAVHL | 0.944557 | 0.955705 | 0.915285 | 0.936430 | 0.904318 | 0.968754 | 0.595083 | 0.563725 | 0.900130 | 0.891405 | ... | 1.000000 | 0.871581 | 0.877520 | 0.733011 | 0.948184 | 0.851303 | 0.873230 | 0.885766 | 0.637952 | 0.933745 |
| TKFEN | 0.756897 | 0.792686 | 0.836240 | 0.865535 | 0.703699 | 0.846989 | 0.734612 | 0.721908 | 0.671257 | 0.847256 | ... | 0.871581 | 1.000000 | 0.736432 | 0.855586 | 0.820159 | 0.855116 | 0.765421 | 0.790444 | 0.843598 | 0.755315 |
| TUPRS | 0.821960 | 0.926268 | 0.829604 | 0.910466 | 0.949890 | 0.903814 | 0.323870 | 0.520701 | 0.793044 | 0.825257 | ... | 0.877520 | 0.736432 | 1.000000 | 0.561396 | 0.897326 | 0.737115 | 0.949840 | 0.920275 | 0.498160 | 0.913698 |
| TTKOM | 0.642405 | 0.589922 | 0.694350 | 0.728698 | 0.485523 | 0.693496 | 0.792803 | 0.737080 | 0.512163 | 0.690217 | ... | 0.733011 | 0.855586 | 0.561396 | 1.000000 | 0.731267 | 0.620413 | 0.528952 | 0.586779 | 0.851931 | 0.529287 |
| TCELL | 0.928303 | 0.946678 | 0.860706 | 0.926500 | 0.911954 | 0.949857 | 0.561368 | 0.544204 | 0.871003 | 0.876429 | ... | 0.948184 | 0.820159 | 0.897326 | 0.731267 | 1.000000 | 0.804693 | 0.865204 | 0.861951 | 0.575789 | 0.895108 |
| HALKB | 0.808102 | 0.847779 | 0.808142 | 0.774860 | 0.777245 | 0.827678 | 0.604534 | 0.533726 | 0.775621 | 0.797647 | ... | 0.851303 | 0.855116 | 0.737115 | 0.620413 | 0.804693 | 1.000000 | 0.832923 | 0.774999 | 0.526239 | 0.851982 |
| ISCTR | 0.805504 | 0.943880 | 0.836013 | 0.863003 | 0.946244 | 0.897552 | 0.313360 | 0.469383 | 0.802492 | 0.808500 | ... | 0.873230 | 0.765421 | 0.949840 | 0.528952 | 0.865204 | 0.832923 | 1.000000 | 0.906079 | 0.445095 | 0.954393 |
| VAKBN | 0.780396 | 0.901830 | 0.779591 | 0.921890 | 0.918410 | 0.920870 | 0.375774 | 0.506496 | 0.737896 | 0.865353 | ... | 0.885766 | 0.790444 | 0.920275 | 0.586779 | 0.861951 | 0.774999 | 0.906079 | 1.000000 | 0.617193 | 0.916388 |
| VESTL | 0.479559 | 0.484287 | 0.615708 | 0.710938 | 0.392410 | 0.616914 | 0.752757 | 0.786236 | 0.342505 | 0.659768 | ... | 0.637952 | 0.843598 | 0.498160 | 0.851931 | 0.575789 | 0.526239 | 0.445095 | 0.617193 | 1.000000 | 0.429960 |
| YKBNK | 0.896445 | 0.976499 | 0.858877 | 0.876496 | 0.969189 | 0.938717 | 0.362189 | 0.409990 | 0.896897 | 0.832366 | ... | 0.933745 | 0.755315 | 0.913698 | 0.529287 | 0.895108 | 0.851982 | 0.954393 | 0.916388 | 0.429960 | 1.000000 |
30 rows × 30 columns
# plotting correlation heatmap
plt.figure(figsize=(16, 16))
heatmap = sns.heatmap(correlation_matrix, cmap=plt.cm.CMRmap_r, annot=True)
# displaying heatmap
heatmap
<Axes: >
Following code is taken from a data scientist/youtuber called Krish Naik.
def correlation(dataset, threshold):
col_corr = [] # List of all the names of correlated columns
for i in range(len(correlation_matrix.columns)):
for j in range(i):
if abs(correlation_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
colname1 = correlation_matrix.columns[i]
colname2 = correlation_matrix.columns[j] # getting the name of column
col_corr.append([colname1, colname2])
return col_corr
corr_comps = correlation(merged_df, 0.98)
corr_comps
[['GARAN', 'AKBNK'], ['KCHOL', 'AKBNK']]
Highest correlation pairs are [['GARAN', 'AKBNK'], ['KCHOL', 'AKBNK']] with 98% correlation between them.
data1[data1["short_name"]=="GARAN"].plot(x="timestamp", y="price")
<Axes: xlabel='timestamp'>
data1[data1["short_name"]=="AKBNK"].plot(x="timestamp", y="price")
<Axes: xlabel='timestamp'>
diff_df1 = merged_df[["timestamp", "GARAN", "AKBNK"]]
diff_df1
| timestamp | GARAN | AKBNK | |
|---|---|---|---|
| 0 | 2022-12-26 09:00:00+03:00 | 28.0022 | 18.1629 |
| 1 | 2022-12-26 10:00:00+03:00 | 28.2995 | 18.1992 |
| 2 | 2022-12-26 11:00:00+03:00 | 28.2251 | 18.1629 |
| 3 | 2022-12-26 12:00:00+03:00 | 28.2995 | 18.1811 |
| 4 | 2022-12-26 13:00:00+03:00 | 28.2995 | 18.1448 |
| ... | ... | ... | ... |
| 2225 | 2023-11-21 09:00:00+03:00 | 49.0000 | 30.6800 |
| 2226 | 2023-11-21 10:00:00+03:00 | 49.0600 | 30.4800 |
| 2227 | 2023-11-21 11:00:00+03:00 | 48.7600 | 30.4800 |
| 2228 | 2023-11-21 12:00:00+03:00 | 48.6800 | 30.5000 |
| 2229 | 2023-11-21 13:00:00+03:00 | 48.5800 | 30.4600 |
2230 rows × 3 columns
diff_df1["GARAN-AKBNK"] = np.log1p(diff_df1["GARAN"]-diff_df1["AKBNK"])
diff_df1
"""
diff_df1["GARAN-AKBNK"] = (diff_df1["GARAN"]-diff_df1["AKBNK"])
diff_df1
"""
<ipython-input-312-bc63d12a049f>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy diff_df1["GARAN-AKBNK"] = np.log1p(diff_df1["GARAN"]-diff_df1["AKBNK"])
'\ndiff_df1["GARAN-AKBNK"] = (diff_df1["GARAN"]-diff_df1["AKBNK"])\ndiff_df1\n'
ChatGPT prompt: Write a code to print acf and pacf.
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Plot ACF
plt.figure()
plot_acf(diff_df1["GARAN-AKBNK"], lags=365, alpha=0.05) # You can specify the number of lags and the confidence interval
plt.title('Autocorrelation Function')
# Plot PACF
plt.figure()
plot_pacf(diff_df1["GARAN-AKBNK"], lags=365, alpha=0.05) # You can specify the number of lags and the confidence interval
plt.title('Partial Autocorrelation Function')
plt.show()
<Figure size 3000x1000 with 0 Axes>
<Figure size 3000x1000 with 0 Axes>
from statsmodels.tsa.seasonal import seasonal_decompose
# Perform seasonal decomposition
# The frequency is an important parameter that depends on the data's seasonality
# For example, if you have daily data with a weekly cycle, you would use freq=7
result = seasonal_decompose(diff_df1["GARAN-AKBNK"], model='additive', period=12)
# Plot the seasonal decomposition
plt.rcParams.update({'figure.figsize': (30,10)})
result.plot()
plt.show()
ChatGPT prompt: How to do linear regression by using statsmodels?
import statsmodels.api as sm
"""
y = diff_df1["GARAN-AKBNK"].values
# Convert datetime to numeric (e.g., number of days since the start date of the dataset)
start_date = diff_df1["timestamp"].min()
X = (diff_df1["timestamp"] - start_date).dt.days.values
# Linear regression
model = sm.OLS(y, X).fit()
"""
# Response variable
y = diff_df1["GARAN-AKBNK"].values
# Predictor variable (with a constant added for the intercept term)
X = (diff_df1["timestamp"] - diff_df1["timestamp"].min()).dt.days.values
X = sm.add_constant(X) # Adds a constant term to the predictor
# Linear regression
model = sm.OLS(y, X).fit()
# Make predictions using the model
predictions = model.predict(X)
plt.figure(figsize=(10,6))
# Plot the original data
plt.scatter(diff_df1["timestamp"], y, label='Data', color='blue', s=5, marker='o')
# Plot the regression line
plt.plot(diff_df1["timestamp"], predictions, label='Regression Line', color='red')
# Formatting plot with title and labels
plt.title('Data and Regression Line')
plt.xlabel('Timestamp')
plt.ylabel('GARAN-AKBNK')
plt.legend()
# Show the plot
plt.show()
# Calculate residuals
residuals = model.resid
# Calculate control limits
mean_residual = np.mean(residuals)
std_residual = np.std(residuals)
upper_control_limit = mean_residual + 3 * std_residual
lower_control_limit = mean_residual - 3 * std_residual
# Plot control chart
plt.plot(residuals)
plt.axhline(mean_residual, color='blue', linestyle='--')
plt.axhline(upper_control_limit, color='red', linestyle='--')
plt.axhline(lower_control_limit, color='green', linestyle='--')
plt.title(f'Control Chart for "GARAN" and "AKBNK"')
plt.show()
model.rsquared
0.7696509725770251
data1[data1["short_name"]=="KCHOL"].plot(x="timestamp", y="price")
<Axes: xlabel='timestamp'>
data1[data1["short_name"]=="AKBNK"].plot(x="timestamp", y="price")
<Axes: xlabel='timestamp'>
diff_df2 = merged_df[["timestamp", "KCHOL", "AKBNK"]]
diff_df2["KCHOL-AKBNK"] = diff_df2["KCHOL"]-diff_df2["AKBNK"]
diff_df2
<ipython-input-320-6b65916dd509>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy diff_df2["KCHOL-AKBNK"] = diff_df2["KCHOL"]-diff_df2["AKBNK"]
| timestamp | KCHOL | AKBNK | KCHOL-AKBNK | |
|---|---|---|---|---|
| 0 | 2022-12-26 09:00:00+03:00 | 75.8945 | 18.1629 | 57.7316 |
| 1 | 2022-12-26 10:00:00+03:00 | 76.1879 | 18.1992 | 57.9887 |
| 2 | 2022-12-26 11:00:00+03:00 | 76.3346 | 18.1629 | 58.1717 |
| 3 | 2022-12-26 12:00:00+03:00 | 76.2368 | 18.1811 | 58.0557 |
| 4 | 2022-12-26 13:00:00+03:00 | 76.0412 | 18.1448 | 57.8964 |
| ... | ... | ... | ... | ... |
| 2225 | 2023-11-21 09:00:00+03:00 | 144.0000 | 30.6800 | 113.3200 |
| 2226 | 2023-11-21 10:00:00+03:00 | 143.9000 | 30.4800 | 113.4200 |
| 2227 | 2023-11-21 11:00:00+03:00 | 143.2000 | 30.4800 | 112.7200 |
| 2228 | 2023-11-21 12:00:00+03:00 | 143.4000 | 30.5000 | 112.9000 |
| 2229 | 2023-11-21 13:00:00+03:00 | 143.4000 | 30.4600 | 112.9400 |
2230 rows × 4 columns
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Plot ACF
plt.figure()
plot_acf(diff_df2["KCHOL-AKBNK"], lags=200, alpha=0.05) # You can specify the number of lags and the confidence interval
plt.title('Autocorrelation Function')
# Plot PACF
plt.figure()
plot_pacf(diff_df2["KCHOL-AKBNK"], lags=200, alpha=0.05) # You can specify the number of lags and the confidence interval
plt.title('Partial Autocorrelation Function')
plt.show()
<Figure size 3000x1000 with 0 Axes>
<Figure size 3000x1000 with 0 Axes>
import statsmodels.api as sm
"""
y = diff_df2["KCHOL-AKBNK"].values
# Convert datetime to numeric (e.g., number of days since the start date of the dataset)
start_date = diff_df2["timestamp"].min()
X = (diff_df2["timestamp"] - start_date).dt.days.values
# Linear regression
model = sm.OLS(y, X).fit()
"""
# Response variable
y = diff_df2["KCHOL-AKBNK"].values
# Predictor variable (with a constant added for the intercept term)
X = (diff_df2["timestamp"] - diff_df2["timestamp"].min()).dt.days.values
X = sm.add_constant(X) # Adds a constant term to the predictor
# Linear regression
model = sm.OLS(y, X).fit()
# Make predictions using the model
predictions = model.predict(X)
# Plot the original data
plt.scatter(diff_df2["timestamp"], y, label='Data', color='blue', s=5, marker='o')
# Plot the regression line
plt.plot(diff_df2["timestamp"], predictions, label='Regression Line', color='red')
# Formatting plot with title and labels
plt.title('Data and Regression Line')
plt.xlabel('Timestamp')
plt.ylabel('KCHOL-AKBNK')
plt.legend()
# Show the plot
plt.show()
# Calculate residuals
residuals = model.resid
# Calculate control limits
mean_residual = np.mean(residuals)
std_residual = np.std(residuals)
upper_control_limit = mean_residual + 3 * std_residual
lower_control_limit = mean_residual - 3 * std_residual
# Plot control chart
plt.plot(residuals)
plt.axhline(mean_residual, color='orange', linestyle='--')
plt.axhline(upper_control_limit, color='red', linestyle='--')
plt.axhline(lower_control_limit, color='green', linestyle='--')
plt.title(f'Control Chart for "KCHOL" and "AKBNK"')
plt.show()
model.rsquared
0.8615573840921081
Effectiveness of the Approach
Since there are no out of control points observed in control charts, there were no trade opportunities so we did not make any transactions and made no profit.
We chose to do this part with ARIMA because ARIMA models can account for various patterns, constant or varying volatility, and seasonal or non-seasonal fluctuations. ARIMA models are also easy to implement and interpret, as they only require a few parameters.
from statsmodels.tsa.seasonal import seasonal_decompose
# Perform seasonal decomposition
# The frequency is an important parameter that depends on the data's seasonality
# For example, if you have daily data with a weekly cycle, you would use freq=7
result = seasonal_decompose(diff_df2["KCHOL-AKBNK"], model='additive', period=12)
# Plot the seasonal decomposition
plt.rcParams.update({'figure.figsize': (30,10)})
result.plot()
plt.show()
!pip install pmdarima
Requirement already satisfied: pmdarima in /usr/local/lib/python3.10/dist-packages (2.0.4) Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (1.3.2) Requirement already satisfied: Cython!=0.29.18,!=0.29.31,>=0.29 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (3.0.6) Requirement already satisfied: numpy>=1.21.2 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (1.23.5) Requirement already satisfied: pandas>=0.19 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (1.5.3) Requirement already satisfied: scikit-learn>=0.22 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (1.2.2) Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (1.11.4) Requirement already satisfied: statsmodels>=0.13.2 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (0.14.0) Requirement already satisfied: urllib3 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (2.0.7) Requirement already satisfied: setuptools!=50.0.0,>=38.6.0 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (67.7.2) Requirement already satisfied: packaging>=17.1 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (23.2) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.19->pmdarima) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.19->pmdarima) (2023.3.post1) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.22->pmdarima) (3.2.0) Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.10/dist-packages (from statsmodels>=0.13.2->pmdarima) (0.5.3) Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.2->statsmodels>=0.13.2->pmdarima) (1.16.0)
diff_df1.set_index('timestamp', inplace=True)
ChatGPT prompt: Write arima function in the python, print the residuals.
import pmdarima as pm
# Fit the Auto ARIMA model
model = pm.auto_arima(diff_df1[["GARAN-AKBNK"]], start_p=1, start_q=1,
test='adf', # Use the Augmented Dickey-Fuller test to find optimal 'd'
max_p=3, max_q=3, max_P=3, max_Q=3, # Maximum p and q
m=12, # Frequency of the series
d=None, D=None, # Let the model determine 'd'
seasonal=True, # Seasonality
start_P=0,
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True
)
# Print the summary
print(model.summary())
# Print the residuals
residuals = model.resid()
print("Residuals:")
print(residuals)
# Forecasting (optional)
n_periods = 5
forecast, conf_int = model.predict(n_periods=n_periods, return_conf_int=True)
print(f"Forecast: {forecast}")
print(f"Confidence Intervals: {conf_int}")
Performing stepwise search to minimize aic
ARIMA(1,1,1)(0,0,1)[12] intercept : AIC=-12081.008, Time=1.98 sec
ARIMA(0,1,0)(0,0,0)[12] intercept : AIC=-12069.944, Time=0.18 sec
ARIMA(1,1,0)(1,0,0)[12] intercept : AIC=-12082.611, Time=1.48 sec
ARIMA(0,1,1)(0,0,1)[12] intercept : AIC=-12082.940, Time=0.63 sec
ARIMA(0,1,0)(0,0,0)[12] : AIC=-12071.388, Time=0.11 sec
ARIMA(0,1,1)(0,0,0)[12] intercept : AIC=-12084.912, Time=0.20 sec
ARIMA(0,1,1)(1,0,0)[12] intercept : AIC=-12082.949, Time=0.59 sec
ARIMA(0,1,1)(1,0,1)[12] intercept : AIC=-12080.940, Time=1.60 sec
ARIMA(1,1,1)(0,0,0)[12] intercept : AIC=-12082.982, Time=1.20 sec
ARIMA(0,1,2)(0,0,0)[12] intercept : AIC=-12082.990, Time=1.15 sec
ARIMA(1,1,0)(0,0,0)[12] intercept : AIC=-12084.570, Time=1.02 sec
ARIMA(1,1,2)(0,0,0)[12] intercept : AIC=-12080.905, Time=1.52 sec
ARIMA(0,1,1)(0,0,0)[12] : AIC=-12086.244, Time=0.19 sec
ARIMA(0,1,1)(1,0,0)[12] : AIC=-12084.280, Time=1.39 sec
ARIMA(0,1,1)(0,0,1)[12] : AIC=-12084.281, Time=1.47 sec
ARIMA(0,1,1)(1,0,1)[12] : AIC=-12082.281, Time=2.19 sec
ARIMA(1,1,1)(0,0,0)[12] : AIC=-12084.306, Time=0.76 sec
ARIMA(0,1,2)(0,0,0)[12] : AIC=-12084.314, Time=0.58 sec
ARIMA(1,1,0)(0,0,0)[12] : AIC=-12085.914, Time=0.22 sec
ARIMA(1,1,2)(0,0,0)[12] : AIC=-12082.323, Time=1.10 sec
Best model: ARIMA(0,1,1)(0,0,0)[12]
Total fit time: 19.618 seconds
SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 2230
Model: SARIMAX(0, 1, 1) Log Likelihood 6045.122
Date: Mon, 11 Dec 2023 AIC -12086.244
Time: 20:38:24 BIC -12074.825
Sample: 0 HQIC -12082.074
- 2230
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ma.L1 -0.0870 0.015 -5.811 0.000 -0.116 -0.058
sigma2 0.0003 3.75e-06 68.849 0.000 0.000 0.000
===================================================================================
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 4164.53
Prob(Q): 0.98 Prob(JB): 0.00
Heteroskedasticity (H): 0.66 Skew: 0.35
Prob(H) (two-sided): 0.00 Kurtosis: 9.66
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Residuals:
timestamp
2022-12-26 09:00:00+03:00 2.383178
2022-12-26 10:00:00+03:00 0.023794
2022-12-26 11:00:00+03:00 -0.001384
2022-12-26 12:00:00+03:00 0.004947
2022-12-26 13:00:00+03:00 0.003690
...
2023-11-21 09:00:00+03:00 0.005982
2023-11-21 10:00:00+03:00 0.013888
2023-11-21 11:00:00+03:00 -0.014232
2023-11-21 12:00:00+03:00 -0.006438
2023-11-21 13:00:00+03:00 -0.003693
Length: 2230, dtype: float64
Forecast: 2230 2.951056
2231 2.951056
2232 2.951056
2233 2.951056
2234 2.951056
dtype: float64
Confidence Intervals: [[2.91957458 2.98253765]
[2.90842664 2.99368558]
[2.89964156 3.00247067]
[2.89215245 3.00995978]
[2.88551356 3.01659867]]
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index(
The function chose ARIMA(0,1,1) as the best model that fits the data. Its AIC value is the lowest of all models.
# Ensure residuals.index is a DateTimeIndex
residuals.index = pd.to_datetime(residuals.index)
start_date = dt.datetime.strptime("2022-12-28", "%Y-%m-%d")
end_date = dt.datetime.strptime("2023-12-24", "%Y-%m-%d")
# Calculate control limits
mean_residual = np.mean(residuals)
std_residual = np.std(residuals)
upper_control_limit1 = mean_residual + 3 * std_residual
lower_control_limit1 = mean_residual - 3 * std_residual
# Plot control chart
plt.figure(figsize=(20,6))
plt.plot(residuals)
plt.axhline(mean_residual, color='orange', linestyle='--')
plt.axhline(upper_control_limit1, color='red', linestyle='--')
plt.axhline(lower_control_limit1, color='green', linestyle='--')
plt.title(f'Control Chart for "GARAN" and "AKBNK"')
plt.ylim(-1,1)
plt.xlim(start_date, end_date)
plt.show()
# Filter the DataFrame for points where residuals exceed the control limits
out_of_control_points = residuals[(residuals > upper_control_limit1) | (residuals < lower_control_limit1)]
out_of_control_points
timestamp 2022-12-26 09:00:00+03:00 2.383178 dtype: float64
# Extract the dates for these points
out_of_control_dates = out_of_control_points.index
out_of_control_dates
DatetimeIndex(['2022-12-26 09:00:00+03:00'], dtype='datetime64[ns, pytz.FixedOffset(180)]', name='timestamp', freq=None)
for time in out_of_control_dates:
out_of_controls = diff_df1[diff_df1.index == time]
print(out_of_controls)
GARAN AKBNK GARAN-AKBNK timestamp 2022-12-26 09:00:00+03:00 28.0022 18.1629 2.383178
A control chart has been constructed according to the ARIMA model containing 3-sigma limits. The model found 1 out of control signals for “GARAN” and “AKBANK”. That’s where we will do pairs trading.
diff_df2.set_index('timestamp', inplace=True)
import pmdarima as pm
# Fit the Auto ARIMA model
model = pm.auto_arima(diff_df2[["KCHOL-AKBNK"]], start_p=1, start_q=1,
test='adf', # Use the Augmented Dickey-Fuller test to find optimal 'd'
max_p=3, max_q=3, # Maximum p and q
m=1, # Frequency of the series
d=None, # Let the model determine 'd'
seasonal=True, # Seasonality
start_P=0,
D=0,
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True
)
# Print the summary
print(model.summary())
# Print the residuals
residuals = model.resid()
print("Residuals:")
print(residuals)
# Forecasting (optional)
n_periods = 5
forecast, conf_int = model.predict(n_periods=n_periods, return_conf_int=True)
print(f"Forecast: {forecast}")
print(f"Confidence Intervals: {conf_int}")
Performing stepwise search to minimize aic
ARIMA(1,1,1)(0,0,0)[0] intercept : AIC=5640.213, Time=0.64 sec
ARIMA(0,1,0)(0,0,0)[0] intercept : AIC=5645.641, Time=0.34 sec
ARIMA(1,1,0)(0,0,0)[0] intercept : AIC=5638.263, Time=0.15 sec
ARIMA(0,1,1)(0,0,0)[0] intercept : AIC=5638.211, Time=0.36 sec
ARIMA(0,1,0)(0,0,0)[0] : AIC=5645.499, Time=0.18 sec
ARIMA(0,1,2)(0,0,0)[0] intercept : AIC=5640.211, Time=0.86 sec
ARIMA(1,1,2)(0,0,0)[0] intercept : AIC=5642.094, Time=1.05 sec
ARIMA(0,1,1)(0,0,0)[0] : AIC=5638.342, Time=0.18 sec
Best model: ARIMA(0,1,1)(0,0,0)[0] intercept
Total fit time: 3.805 seconds
SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 2230
Model: SARIMAX(0, 1, 1) Log Likelihood -2816.106
Date: Mon, 11 Dec 2023 AIC 5638.211
Time: 20:38:30 BIC 5655.339
Sample: 0 HQIC 5644.466
- 2230
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
intercept 0.0248 0.017 1.454 0.146 -0.009 0.058
ma.L1 -0.0651 0.016 -4.120 0.000 -0.096 -0.034
sigma2 0.7326 0.010 76.487 0.000 0.714 0.751
===================================================================================
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 6730.41
Prob(Q): 1.00 Prob(JB): 0.00
Heteroskedasticity (H): 2.00 Skew: -0.01
Prob(H) (two-sided): 0.00 Kurtosis: 11.51
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Residuals:
timestamp
2022-12-26 09:00:00+03:00 57.706845
2022-12-26 10:00:00+03:00 0.232347
2022-12-26 11:00:00+03:00 0.173300
2022-12-26 12:00:00+03:00 -0.129479
2022-12-26 13:00:00+03:00 -0.192481
...
2023-11-21 09:00:00+03:00 0.553345
2023-11-21 10:00:00+03:00 0.111250
2023-11-21 11:00:00+03:00 -0.717516
2023-11-21 12:00:00+03:00 0.108556
2023-11-21 13:00:00+03:00 0.022308
Length: 2230, dtype: float64
Forecast: 2230 112.963304
2231 112.988059
2232 113.012815
2233 113.037570
2234 113.062326
dtype: float64
Confidence Intervals: [[111.28568574 114.64092201]
[110.69144054 115.28467814]
[110.23171552 115.79391409]
[109.84467683 116.23046371]
[109.50499192 116.61965954]]
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index(
The function chose ARIMA(0,1,1) as the best model that fits the data. Its AIC value is the lowest of all models.
# Ensure residuals.index is a DateTimeIndex
residuals.index = pd.to_datetime(residuals.index)
start_date = dt.datetime.strptime("2022-12-28", "%Y-%m-%d")
end_date = dt.datetime.strptime("2023-12-24", "%Y-%m-%d")
# Calculate control limits
mean_residual = np.mean(residuals)
std_residual = np.std(residuals)
upper_control_limit2 = mean_residual + 3 * std_residual
lower_control_limit2 = mean_residual - 3 * std_residual
# Plot control chart
plt.figure(figsize=(20,6))
plt.plot(residuals)
plt.axhline(mean_residual, color='orange', linestyle='--')
plt.axhline(upper_control_limit2, color='red', linestyle='--')
plt.axhline(lower_control_limit2, color='green', linestyle='--')
plt.title(f'Control Chart for "KCHOL" and "AKBNK"')
plt.ylim(-8,8)
plt.xlim(start_date, end_date)
plt.show()
# Filter the DataFrame for points where residuals exceed the control limits
out_of_control_points = residuals[(residuals > upper_control_limit2) | (residuals < lower_control_limit2)]
out_of_control_points
timestamp 2022-12-26 09:00:00+03:00 57.706845 2023-05-15 09:00:00+03:00 -7.800859 2023-05-16 09:00:00+03:00 -6.996478 2023-08-24 17:00:00+03:00 -4.925833 dtype: float64
# Extract the dates for these points
out_of_control_dates = out_of_control_points.index
out_of_control_dates
DatetimeIndex(['2022-12-26 09:00:00+03:00', '2023-05-15 09:00:00+03:00',
'2023-05-16 09:00:00+03:00', '2023-08-24 17:00:00+03:00'],
dtype='datetime64[ns, pytz.FixedOffset(180)]', name='timestamp', freq=None)
for time in out_of_control_dates:
out_of_controls = diff_df2[diff_df2.index == time]
print(out_of_controls)
KCHOL AKBNK KCHOL-AKBNK
timestamp
2022-12-26 09:00:00+03:00 75.8945 18.1629 57.7316
KCHOL AKBNK KCHOL-AKBNK
timestamp
2023-05-15 09:00:00+03:00 87.2 17.37 69.83
KCHOL AKBNK KCHOL-AKBNK
timestamp
2023-05-16 09:00:00+03:00 78.5 15.64 62.86
KCHOL AKBNK KCHOL-AKBNK
timestamp
2023-08-24 17:00:00+03:00 137.2 30.18 107.02
A control chart has been constructed according to the ARIMA model containing 3-sigma limits. The model found 4 out of control signal for “KCHOL” and “AKBANK”. That’s where we will do pairs trading.
diff_df1["residuals"] = diff_df1["GARAN-AKBNK"]
diff_df2["residuals"] = diff_df2["KCHOL-AKBNK"]
<ipython-input-339-939615b25e99>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy diff_df1["residuals"] = diff_df1["GARAN-AKBNK"] <ipython-input-339-939615b25e99>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy diff_df2["residuals"] = diff_df2["KCHOL-AKBNK"]
ChatGPT prompt of this function is so complicated, we kinda did a fine-tuning for this response.
Trading Strategy: In this trading strategy, when the residuals of the differences fall above or below the control limits outperforming stock is sold and underperforming stock is bought. In other words, let's suppose differences are calculated between stocks A and B as A minus B. If there is an out-of-control signal below LCL, we buy stock A and sell stock B. This is because stock A will be outperforming and stock B will be underperforming since the value “A minus B” will be lower if it falls below LCL. On the other hand, if there is an out-of-control signal above UCL, we buy stock B and sell stock A.
To make a profit using this strategy for “A-B” whenever we get an out-of-control signal below LCL we buy A and sell B, and keep those stocks and do not sell them until we get an out-of-control signal above UCL, as in that case we sell A and buy B and make a profit. There is always an opportunity for profit using this strategy simply because we sell the stocks at a price higher than the price we buy them.
def trade_on_residuals(df, upper_control_limit, lower_control_limit, initial_cash, first_comp="A",second_comp= "B"):
# Initial holdings are cash only, no stocks
holdings = {first_comp: 5, second_comp: 5}
cash = initial_cash
transactions = [] # To record the transactions
# Iterate over the DataFrame rows
for index, row in df.iterrows():
residual = row['residuals']
price_A = row[first_comp]
price_B = row[second_comp]
# Check if residual exceeds the upper control limit
if residual > upper_control_limit:
# Sell A and buy B
if holdings[first_comp] > 0:
# Calculate number of shares to sell
shares_to_sell = holdings[first_comp]
# Update cash and holdings after selling A
cash += shares_to_sell * price_A
holdings[first_comp] -= shares_to_sell
transactions.append({'action': 'sell', 'stock': first_comp, 'shares': shares_to_sell, 'price': price_A})
# Use all available cash to buy B
shares_to_buy = cash // price_B
cash -= shares_to_buy * price_B
holdings[second_comp] += shares_to_buy
transactions.append({'action': 'buy', 'stock': second_comp, 'shares': shares_to_buy, 'price': price_B})
# Check if residual exceeds the lower control limit
elif residual < lower_control_limit:
# Sell B and buy A
if holdings[second_comp] > 0:
# Calculate number of shares to sell
shares_to_sell = holdings[second_comp]
# Update cash and holdings after selling B
cash += shares_to_sell * price_B
holdings[second_comp] -= shares_to_sell
transactions.append({'action': 'sell', 'stock': second_comp, 'shares': shares_to_sell, 'price': price_B})
# Use all available cash to buy A
shares_to_buy = cash // price_A
cash -= shares_to_buy * price_A
holdings[first_comp] += shares_to_buy
transactions.append({'action': 'buy', 'stock': first_comp, 'shares': shares_to_buy, 'price': price_A})
transactions_df = pd.DataFrame(transactions)
# Get the last price of stock A and B
last_price_A = df.iloc[-1][first_comp]
last_price_B = df.iloc[-1][second_comp]
# Calculate the value of the holdings at the last available prices
value_of_A = holdings[first_comp] * last_price_A
value_of_B = holdings[second_comp] * last_price_B
net_final_value = cash + value_of_A + value_of_B
# Return the net final value along with the other information
return holdings, cash, transactions_df, net_final_value
# Examples
initial_cash = 10000 # Initial cash amount
holdings, remaining_cash, transactions, net_final_value = trade_on_residuals(
diff_df2,
upper_control_limit2,
lower_control_limit2,
initial_cash,
"KCHOL",
"AKBNK"
)
print(f"Final Holdings: {holdings}")
print(f"Remaining Cash: {remaining_cash}")
print(f"Net Final Value: {net_final_value}")
transactions
Final Holdings: {'KCHOL': 0, 'AKBNK': 576.0}
Remaining Cash: 8.456599999999526
Net Final Value: 17553.416599999997
| action | stock | shares | price | |
|---|---|---|---|---|
| 0 | sell | KCHOL | 5.0 | 75.8945 |
| 1 | buy | AKBNK | 571.0 | 18.1629 |
# Examples
initial_cash2 = 20000 # Initial cash amount
holdings2, remaining_cash2, transactions2, net_final_value2 = trade_on_residuals(
diff_df1,
upper_control_limit1,
lower_control_limit1,
initial_cash,
"GARAN",
"AKBNK"
)
print(f"Final Holdings: {holdings2}")
print(f"Remaining Cash: {remaining_cash2}")
print(f"Net Final Value: {net_final_value2}")
transactions2
Final Holdings: {'GARAN': 0, 'AKBNK': 563.0}
Remaining Cash: 5.112800000000789
Net Final Value: 17154.0928
| action | stock | shares | price | |
|---|---|---|---|---|
| 0 | sell | GARAN | 5.0 | 28.0022 |
| 1 | buy | AKBNK | 558.0 | 18.1629 |
# Examples
initial_cash3 = 50000 # Initial cash amount
holdings3, remaining_cash3, transactions3, net_final_value3 = trade_on_residuals(
diff_df1,
upper_control_limit1,
lower_control_limit1,
initial_cash,
"GARAN",
"AKBNK"
)
print(f"Final Holdings: {holdings3}")
print(f"Remaining Cash: {remaining_cash3}")
print(f"Net Final Value: {net_final_value3}")
transactions3
Final Holdings: {'GARAN': 0, 'AKBNK': 563.0}
Remaining Cash: 5.112800000000789
Net Final Value: 17154.0928
| action | stock | shares | price | |
|---|---|---|---|---|
| 0 | sell | GARAN | 5.0 | 28.0022 |
| 1 | buy | AKBNK | 558.0 | 18.1629 |
# Examples
initial_cash4 = 50000 # Initial cash amount
holdings4, remaining_cash4, transactions4, net_final_value4 = trade_on_residuals(
diff_df2,
upper_control_limit2,
lower_control_limit2,
initial_cash,
"KCHOL",
"AKBNK"
)
print(f"Final Holdings: {holdings4}")
print(f"Remaining Cash: {remaining_cash4}")
print(f"Net Final Value: {net_final_value4}")
transactions4
Final Holdings: {'KCHOL': 0, 'AKBNK': 576.0}
Remaining Cash: 8.456599999999526
Net Final Value: 17553.416599999997
| action | stock | shares | price | |
|---|---|---|---|---|
| 0 | sell | KCHOL | 5.0 | 75.8945 |
| 1 | buy | AKBNK | 571.0 | 18.1629 |
# Examples
initial_cash5 = 4000 # Initial cash amount
holdings5, remaining_cash5, transactions5, net_final_value5 = trade_on_residuals(
diff_df1,
upper_control_limit1,
lower_control_limit1,
initial_cash,
"GARAN",
"AKBNK"
)
print(f"Final Holdings: {holdings5}")
print(f"Remaining Cash: {remaining_cash5}")
print(f"Net Final Value: {net_final_value5}")
transactions5
Final Holdings: {'GARAN': 0, 'AKBNK': 563.0}
Remaining Cash: 5.112800000000789
Net Final Value: 17154.0928
| action | stock | shares | price | |
|---|---|---|---|---|
| 0 | sell | GARAN | 5.0 | 28.0022 |
| 1 | buy | AKBNK | 558.0 | 18.1629 |
# Examples
initial_cash6 = 4000 # Initial cash amount
holdings6, remaining_cash6, transactions6, net_final_value6 = trade_on_residuals(
diff_df2,
upper_control_limit2,
lower_control_limit2,
initial_cash,
"KCHOL",
"AKBNK"
)
print(f"Final Holdings: {holdings6}")
print(f"Remaining Cash: {remaining_cash6}")
print(f"Net Final Value: {net_final_value6}")
transactions6
Final Holdings: {'KCHOL': 0, 'AKBNK': 576.0}
Remaining Cash: 8.456599999999526
Net Final Value: 17553.416599999997
| action | stock | shares | price | |
|---|---|---|---|---|
| 0 | sell | KCHOL | 5.0 | 75.8945 |
| 1 | buy | AKBNK | 571.0 | 18.1629 |
In conclusion, by using ARIMA model, we observed that out of control charts can give insights about when to buy and sell two highly correlated stocks. Since the linear regression is a simple model, it fails to capture the deviation from the expected relationship between stocks. By implementing out of control charts to pairs trading, we can make profit by utilizing the relationship between highly correlated stocks.